WEEK 9: LINEAR REGRESSION & SIMULATING DISTRIBUTIONS
Monday, March 6th
Today we will…
Questions from Lab/Challenge 8
Final Project Part 2 – Linear Regression
Mini lecture on text material
Review of Simple Linear Regression
Conditions of Linear Regression
PA 9.1: Mystery Animal
Lab 9: Baby Names
NC Births Data
The ncbirths data set is a random sample of 1,000 cases taken from a larger data set collected in North Carolina in 2004.
Each case describes the birth of a single child born in North Carolina, along with various characteristics of the child (e.g. birth weight, length of gestation, etc.), the child’s mother (e.g. age, weight gained during pregnancy, smoking habits, etc.) and the child’s father (e.g. age).
library (openintro)
data (ncbirths)
head (ncbirths) |>
knitr:: kable () |>
kableExtra:: scroll_box (height = "200px" )
fage
mage
mature
weeks
premie
visits
marital
gained
weight
lowbirthweight
gender
habit
whitemom
NA
13
younger mom
39
full term
10
not married
38
7.63
not low
male
nonsmoker
not white
NA
14
younger mom
42
full term
15
not married
20
7.88
not low
male
nonsmoker
not white
19
15
younger mom
37
full term
11
not married
38
6.63
not low
female
nonsmoker
white
21
15
younger mom
41
full term
6
not married
34
8.00
not low
male
nonsmoker
white
NA
15
younger mom
39
full term
9
not married
27
6.38
not low
female
nonsmoker
not white
NA
15
younger mom
38
full term
19
not married
22
5.38
low
male
nonsmoker
not white
Relationships Between Variables
In a statistical model, we generally have one variable that is the output and one or more variables that are the inputs.
Response variable
a.k.a. \(y\) , dependent
The quantity you want to understand
Explanatory variable
a.k.a. \(x\) , independent, explanatory, predictor
Something you think might be related to the response
Visualizing the Relationship
The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”
Characterizing relationships
Form – linear, quadratic, non-linear?
Direction – positive, negative?
Strength – how much scatter/noise?
Unusual observations – do points not fit the overall pattern?
Your turn!
How would your characterize this relationship?
shape
direction
strength
outliers
As we work through these, please keep in mind that much of what we are doing at this stage involves making judgment calls. This is part of the nature of statistics, and while it can be frustrating - especially as a beginner - it is inescapable. For better or for worse, statistics is not a field where there is one right answer. There are of course an infinite number of indefensible claims, but many judgments are open to interpretation.
There isn’t a universal, hard-and-fast definition of what constitutes an outlier, but they are often easy to spot in a scatterplot.
Simple Linear Regression (SLR)
Assume the relationship between \(x\) and \(y\) takes the form of a linear function.
\[
response = intercept + slope \cdot explanatory + noise
\]
Population Regression Model
\(Y = \beta_0 + \beta_1 \cdot X + \epsilon_i\)
\(\epsilon \sim N(0, \sigma)\)
Fitted Regression Model (Sample )
\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 \cdot x\)
Models are ubiquitous in statistics. In many cases, we assume that the value of our response variable is some function of our explanatory variable, plus some random noise.
What we are saying here is that there is some mathematical function \(f\) , which can translate values of one variable into values of another, except that there is some randomness in the process. What often distinguishes statisticians from other quantitative researchers is the way that we try to model that random noise.
Fitting an SLR Model
ncbirths |>
ggplot (aes (x = gained,
y = weight)
) +
geom_jitter () +
geom_smooth (method = "lm" )
ncbirth_lm <- lm (weight ~ gained,
data = ncbirths
)
library (broom)
tidy (ncbirth_lm)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.63 0.111 59.6 0
2 gained 0.0161 0.00332 4.86 0.00000135
Intercept: Expected mean value for \(y\) , when \(x\) is 0
Slope: Expected change in the mean of \(y\) , when \(x\) is increased by 1 unit
Difference between the observed (point) and the expected (line).
ncbirth_lm$ residuals |>
head (3 )
1 2 3
0.3866279 0.9271567 -0.6133721
resid (ncbirth_lm) |>
head (3 )
1 2 3
0.3866279 0.9271567 -0.6133721
library (broom)
augment (ncbirth_lm) |>
head (3 )
# A tibble: 3 × 9
.rownames weight gained .fitted .resid .hat .sigma .cooksd .std.resid
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7.63 38 7.24 0.387 0.00133 1.47 0.0000458 0.262
2 2 7.88 20 6.95 0.927 0.00157 1.47 0.000311 0.630
3 3 6.63 38 7.24 -0.613 0.00133 1.47 0.000115 -0.416
Model Diagnostics
Linear Relationship
Almost nothing you explore will look perfectly linear
Be careful with relationships that have curvature
Variable transformations can often help
Independence of observations
Nearly every statistical model you will encounter, require for the observations in the data to be independent.
This is often the most crucial but also the most difficult to diagnose.
What does independence mean?
Independence means that I should not be able to know the \(y\) value for one observation based on knowing the \(y\) value of another observation.
Situations with independence violations
If there is an inherent grouping of observations, then independence may be violated.
Stock market prices over time
Geographical similarities
Biological conditions of family members
Repeated observations
Normally distributed residuals
Observations vary symmetrically around the least squares line, spreading out in a bell-shaped fashion.
Less important than linearity or independence for a few reasons:
Least squares is an unbiased estimate of the true population model.
Larger sample sizes make the model more reliable.
Equal (constant) variance
The variability of points around the least squares line remains roughly constant.
Data that exhibit non-equal variance across the range of x-values will have the potential to seriously mis-estimate the variability of the slope.
ncbirth_lm |>
augment () |>
ggplot (aes (x = .fitted, y = .resid)) +
geom_point () +
Assessing Model Fit
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
gained 1 51.36 51.357 23.642 1.353e-06 ***
Residuals 971 2109.32 2.172
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = weight ~ gained, data = ncbirths)
Residuals:
Min 1Q Median 3Q Max
-6.4085 -0.6950 0.1643 0.9222 4.5158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.63003 0.11120 59.620 < 2e-16 ***
gained 0.01614 0.00332 4.862 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.474 on 971 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.02377, Adjusted R-squared: 0.02276
F-statistic: 23.64 on 1 and 971 DF, p-value: 1.353e-06
Sum of Square Errors (SSE)
Root Mean Square Error (RMSE)
standard deviation of residuals
R-squared
proportion of variability in response accounted for by the linear model
Model Comparison
Weight as explanatory variable
weight_weeks <- lm (weight ~ weeks,
data = ncbirths
)
SSE = 1246.55
RMSE = 1.119
\(R^2\) = 0.449
Visit as explanatory variable
weight_visits <- lm (weight ~ visits,
data = ncbirths
)
SSE = 2152.74
RMSE = 1.475
\(R^2\) = 0.01819
Extending to Multiple Linear Regression
If you have multiple explanatory variables you want to include in the model:
lm(y ~ x1 + x2 + x3 + ... + xn)
If you want to include interactions:
lm(y ~ x1 + x2 + x1:x2)
or use the shortcut:
lm(y ~ x1*x2)
You can include quadratic relationships (recall \(ax^2 + bx + c\) )
lm(y ~ I(x1^2) + x1)
A quick note about piping |> and _
Instead of a .x you use _
weight_fullterm <- ncbirths |>
filter (premie == "full term" ) |>
lm (weight ~ weeks,
data = _
)
To do…
PA 9.1: Mystery Animal
Due Wednesday, 3/8 at 8:00am
Lab 9: Baby Names
Due Friday, 3/10 at 11:59pm
Challenge 9: Formatting Nice Tables
Due Saturday, 3/11 at 11:59pm
Project Check-point 2: Linear Regression
Due Sunday, 3/12 at 11:59pm
Wednesday, March 8th
Today we will…
Mini lecture on text material
Worktime
PA 9.2: Instrument Con
Lab & Challenge 9
Project: Linear Regression
Statistical Distributions
Recall from your previous statistics classes…
A random variable is a value we don’t know until we take a sample .
Coin flip: Could be heads (0) or tails (1)
Person’s height: Could be anything from 0 feet to 20 feet.
Annual income of a US worker: $0 to $1.6 billion
The distribution of the random variable tells us the possible values and how likely they are
Coin flip has 50% chance of heads, 50% chance of tails
Likelihood of heights follow a bell curve shape around 5 foot 7.
Most American workers make under $100,000
Some distributions have names
For this class, you need to know a few of them.
Uniform Distribution
When you know the range , but not much else.
All values in the range are equally likely to occur.
Normal Distribution
When you expect values to fall near the center .
Frequency of values follows a bell curve shape.
t-Distribution
Slightly wider bell curve shape.
Basically, the same context as the Normal distribution , but used more often in real data. (When the standard deviation is unknown.)
Chi-Square Distribution
Somewhat skewed , and only allows values above zero .
Used in testing count data .
Binomial Distribution
Appears when you have two possible outcomes , and you are counting how many times each outcome occurred.
(Common example: voting data)
Statistical Distributions
Things you might want to use a Statistical distribution for:
Find the probability of an event
If I flip 10 coins, what are the chances I get all heads?
Estimate a percentage of a population :
About what percent of people are above 6 feet tall?
Quantify the evidence in your data :
In my survey of 100 people, 67 said they were voting for Measure A. How confident am I that Measure A will pass?
Statistical Distributions in R – r, p, d, q
r stands for for random sampling : generate a random value from the distribution
We typically use this to simulate date; that is, to generate pretend observations to see what happens.
[1] 0.6074404 0.5133956 0.6360152
runif (n = 3 , min = 10 , max = 20 )
[1] 18.42964 19.71224 14.78626
[1] 1.1640752 -1.7659451 0.3913231
rnorm (n = 3 , mean = - 100 , sd = 50 )
[1] -76.55332 -42.76307 -130.96689
[1] 0.5711975 1.2991483 0.2861644
rbinom (n = 3 , size = 10 , prob = 0.7 )
[1] 6.827248 7.510705 2.871458
pnorm (q = 70 , mean = 67 , sd = 3 )
1 - pnorm (q = 70 , mean = 67 , sd = 3 )
pnorm (q = 70 , mean = 67 , sd = 3 , lower.tail = FALSE )
q stands for quantile : given a probability p, compute the x-value such that \[P(X < x) = p.\]
the q functions are the “backwards” version of the p functions
qnorm (p = 0.95 , mean = 67 , sd = 3 )
d stands for density : compute the height of distribution curve
What is the probability of getting exactly 12 heads in 20 coin tosses, for a coin with a 70% chance of tails?
dbinom (x = 12 , size = 20 , prob = 0.3 )
Simulating Fake Data
Since there is randomness involved, if we want to create a reproducible random sample, we “set the seed” on so we get the same random sample each time the code is run.
fake_data <- tibble (
names = charlatan:: ch_name (1000 ),
height = rnorm (1000 , mean = 67 , sd = 3 ),
age = runif (1000 , min = 15 , max = 75 ),
measure = rbinom (1000 , size = 1 , prob = 0.6 )
) |>
mutate (
supports_measure_A = ifelse (measure == 1 , "yes" , "no" )
)
head (fake_data)
# A tibble: 6 × 5
names height age measure supports_measure_A
<chr> <dbl> <dbl> <int> <chr>
1 Dr. Dagmar Johns 65.9 31.7 0 no
2 Freddy Donnelly 66.5 24.2 1 yes
3 Mr. Fielding Hackett DDS 71.0 62.9 1 yes
4 Frankie King 64.3 63.0 1 yes
5 Jaslyn Hickle 68.5 66.2 1 yes
6 Dr. Jessee Brekke DDS 71.7 50.0 1 yes
Code
fake_data |>
ggplot (aes (y = supports_measure_A,
x = age,
fill = supports_measure_A)
) +
ggridges:: geom_density_ridges (show.legend = F) +
scale_fill_brewer (palette = "Paired" ) +
theme_bw () +
labs (x = "Age (years)" ,
y = "" ,
subtitle = "Support for Measure A" ,
)
Plotting distributions
fake_data |>
ggplot (aes (x = height)) +
geom_histogram (bins = 10 , fill = "grey" , color = "white" ) +
theme_bw ()
fake_data |>
ggplot (aes (x = height)) +
geom_histogram (bins = 10 , fill = "grey" , color = "white" ) +
stat_function (fun = dnorm,
col = "steelblue" ,
lwd = 2
)
fake_data |>
ggplot (aes (x = height)) +
geom_histogram (bins = 10 , fill = "grey" , color = "white" ) +
stat_function (fun = ~ dnorm (.x, mean = 67 , sd = 3 ),
color = "steelblue" ,
lwd = 2
)
fake_data |>
ggplot (aes (x = height)) +
geom_histogram (aes (y = ..density..),
bins = 10 , fill = "grey" , color = "white" ) +
stat_function (fun = ~ dnorm (.x, mean = 67 , sd = 3 ),
color = "steelblue" ,
lwd = 2
)
Empirical vs Theoretical Distributions
Code
fake_data %>%
ggplot (aes (x = height)) +
geom_histogram (aes (y = ..density..), bins = 10 )
Code
fake_data %>%
ggplot (aes (x = height)) +
stat_function (fun = ~ dnorm (.x, mean = 67 , sd = 3 ),
col = "steelblue" , lwd = 2 ) +
stat_function (fun = ~ dnorm (.x, mean = 67 , sd = 2 ),
col = "orange2" , lwd = 2 )
In-line code `r `
When writing up a report which includes results from a random generation process, in order to ensure reproducibility in your document, use r to include your output within your written description.
my_rand <- rnorm (1 , mean = 0 , sd = 1 )
my_rand
My random number is 0.4240747.
To do…
PA 9.2: Instrument Con
Due Friday, 3/10 at 8:00am
Lab 9: Baby Names
Due Friday, 3/10 at 11:59pm
Challenge 9: Formatting Nice Tables
Due Saturday, 3/11 at 11:59pm
Project Check-point 2: Linear Regression
Due Sunday, 3/12 at 11:59pm
Read Chapter 10: Predictive Checks and complete Check-in 10.1
Due Monday, 3/13 at 8:00am